Recap
Inference
Least squares as
Least Squares
So far, we…
…start with a vector \(\mathbf{Y} = (Y_1 \ldots Y_i \ldots Y_n)\)
… want to produce a vector of predicted values \(\mathbf{\widehat{Y}} = (\widehat{Y_1} \ldots \widehat{Y_i} \ldots \widehat{Y_n})\)
…. define \(\mathbf{\widehat{Y}} = \mathbf{X}\mathbf{\beta}\):
Then, we…
… choose \(\mathbf{\widehat{Y}}\) to have minimum (euclidean) distance to \(\mathbf{Y}\):
… find that \(\mathbf{e}\) is minimized when \(\mathbf{e}'\mathbf{X} = \mathbf{0}\): residuals are orthogonal to \(\mathbf{X}\).
This gave us \(\mathbf{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)
This definition of least squares means that:
residuals \(\mathbf{e}\) are orthogonal to \(\mathbf{X}\); so if there is an intercept, the mean of the residuals is \(0\), and covariance of residuals and \(\mathbf{X}\) is \(0\).
If we just fit an intercept (same value of \(\widehat{Y}\) for all cases): what is the mean of the residuals \(E(Y_i - \widehat{Y_i})\)?
Let’s say we want to get the conditional expectation function for yearly earnings of professionals as a function of hours worked, profession, gender, and age.
Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers.
UHRSWORK (the usual hours worked per week)FEMALE (an indicator for female)AGE (in years)LAW (an indicator for being a lawyer)INCEARN (total annual income earned in dollars).Let’s now find the (approximate) linear conditional expectation function of earnings, across hours worked, age, profession, and gender:
##
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, data = acs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -291485 -89191 -28271 71281 1067525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9929.8 9370.4 1.06 0.289
## UHRSWORK 1245.4 111.9 11.13 <2e-16 ***
## AGE 3311.3 125.0 26.50 <2e-16 ***
## LAW -49672.0 2647.1 -18.76 <2e-16 ***
## FEMALE -42281.4 2811.5 -15.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 127300 on 9995 degrees of freedom
## Multiple R-squared: 0.1502, Adjusted R-squared: 0.1498
## F-statistic: 441.6 on 4 and 9995 DF, p-value: < 2.2e-16
How do we make sense of, e.g., the slope on AGE?
In your small groups:
could we do better at better approximating the conditional expectation function, e.g., of hours worked by Age?
How would you do it?
How would you this change interpretation of UHRSWORK?
ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)
We want to interpret the \(\beta\) vector causally.
There is a true \(\beta_D\) (\(\delta\)) that is the effect of some \(D\) on \(Y\).
\(\beta_D\) is an unknown parameter.
To estimate \(\beta_D\), need a statistical model for least squares
As with experiments (e.g. the Neyman causal model), statistical models
Need to clarify what work is being done by
Hooke’s law is a law of physics that states that the force (\(D\)) needed to extend or compress a spring by some distance \(y\) increases linearly with respect to the distance.
Thus, for a given weight \(D_i\) on a spring, the length \(Y_i\) is:
\[Y_i = \alpha + \beta D_i + \epsilon_i\] for \(i\) in \(1 \ldots n\)
\[Y_i = \alpha + \beta D_i + \epsilon_i\] for \(i\) in \(1 \ldots n\)
This is a statistical model:
\(\epsilon_i\) is a random error (a random variable). Each \(\epsilon_i\) is…
We use observed data to estimate the parameters \(\alpha, \beta, \sigma^2\)
Imagine we have a spring whose length is 30 cm. With each Newton of force applied, the spring increases by 0.5 cm.
Using these facts and the statistical model described above: create a response schedule/potential outcomes table corresponding to the spring under loads of \(0,1,2,3,5,8\) Newtons.
The statistical model for the regression makes certain assumptions about the process generating the data.
The statistical model assumes that this linear equation describes the process generating the data.
\[Y_i = \alpha + \beta D_i + \epsilon_i\]
deterministic (not random \(*\)) \(\alpha + \beta D_i\)
stochastic (random error) \(\epsilon_i\)
\(\alpha\): parameter for the intercept (fixed for all cases)
\(\beta\): parameter for the slope (fixed for all cases)
\(D_i\): value of independent variable (input into the equation)
\(Y_i\): value of dependent variable (determined by the equation)
\(\epsilon_i\): random error for observation \(i\)
The more general regression model (where \(X_i\) is a matrix of variables)
\[Y_i = \mathbf{X_i\beta} + \epsilon_i\]
can be estimated with least squares:
\[Y_i = X_i\hat{\beta} + e_i\]
where \(Var(e) = \widehat{\sigma^2}\)
Differences from Least Squares Algorithm
Least Squares is the Best Linear Unbiased Estimator of our model parameters under 3 key assumptions.
\((1)\). Model equation is correctly specified
In groups: come up with 2 ways this assumption could be wrong in the context of this regression: lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)
\((2.)\) \(\epsilon_i\) are independent, identically distributed (i.i.d.)
In groups: come up with 2 ways this assumption could be wrong in the context of the Hooke’s Law example.
\((3.)\) \(\epsilon_i\) is independent of \(X_i\): \(X_i \perp \!\!\! \perp \epsilon_i\)
In groups: come up with 2 ways this assumption could be wrong in the context of this regression: lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)
Under these assumptions we call the model “ordinary least squares”
We cannot empirically verify these assumptions.
In particular, cannot verify assumption 3:
If these assumptions are reasonable for the data we observe, then with large \(n\), we can estimate our parameters \(\beta, \sigma^2\) using least squares.
Mathematical Assumptions (least squares):
Statistical Assumptions (statistical model):
If \(\mathbf{X}\) is \(n \times p\) matrix: We want to estimate \(p + 1\) parameters… (what are they?)
\(\beta\) estimated by \(\widehat{\beta}\) using least squares:
\[\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]
Under the statistical assumption we made, \(\widehat{\beta}\) is an unbiased estimator of \(\beta\), conditional on \(X\): \(E(\widehat{\beta}|X) = \beta\)
By assumption: \(Y = \mathbf{X\beta} + \epsilon\), so:
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X\beta} + \epsilon)\)
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
So:
\(E(\widehat{\beta}|X) = E(\mathbf{\beta}|\mathbf{X}) + E((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon|\mathbf{X})\)
\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon|\mathbf{X})\)
And because, by assumption, \(X \perp \!\!\! \perp \epsilon\)
\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon)\)
And because, by assumption, \(E(\epsilon) = 0\)
\(E(\widehat{\beta}|X) = \mathbf{\beta}\)
Thus, if our model assumptions are correct, \(\widehat{\beta}\) is an unbiased estimate of parameter \(\beta\)
We need to find a way of finding the uncertainty in our estimate \(\widehat{\beta}\)
\[\scriptsize{\begin{pmatrix} Var(\beta_1) & Cov(\beta_2,\beta_1) & Cov(\beta_3,\beta_1) &\ldots & Cov(\beta_p,\beta_1) \\ Cov(\beta_1,\beta_2) & Var(\beta_2) & Cov(\beta_3,\beta_2) & \ldots & Cov(\beta_p,\beta_2) \\ Cov(\beta_1,\beta_3) & Cov(\beta_2,\beta_3) & Var(\beta_3) & \ldots & Cov(\beta_p,\beta_3) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Cov(\beta_1,\beta_p) & Cov(\beta_2,\beta_p) & Cov(\beta_3, \beta_p) & \ldots & Var(\beta_p)\end{pmatrix}}\]
How do we use the variance-covariance matrix?
The square-root of diagonal elements (variances) gives standard error for each parameter estimate in \(\widehat{\beta}\) (hypothesis testing)
The off-diagonal elements can help answer: \(\beta_2 + \beta_3 \neq 0\). We need \(Cov(\beta_2, \beta_3)\) to get \(Var(\beta_2 + \beta_3)\). (interaction effects)
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\), So:
\[cov(\widehat{\beta}|X) = E((\widehat{\beta} - \beta)(\widehat{\beta} - \beta)' | X)\]
\[ = E( ((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)' | X)\]
\[ = E( ((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon)(\epsilon'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}) | X)\]
\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
What really matters here is: \(E(\epsilon\epsilon'|X)\)
All variance-covariance matrices are “sandwiches”:
\[ (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) is the bread; \(E(\epsilon\epsilon'|X)\) is the meat. We can make different choices of “meat”.
In groups: if \(\epsilon\) is a vector that is \(n \times 1\) with elements \((\epsilon_1 \ldots \epsilon_n)\)
What are the dimensions of the matrix: \(\epsilon\epsilon'\)?
What are the elements on the diagonal of the matrix?
What is the expected value of the elements on the off-diagonal
Hints: We have assumed that \(E(\epsilon) = 0\); \(\epsilon_i\) are independent of each other.
\[E(\epsilon\epsilon'|X) = \sigma^2\mathbf{I}_{n\times n}\]
\(\epsilon\epsilon'\) is \(n \times n\) matrix with \(ij\)th elements \(\epsilon_i\epsilon_j\)
Taking the expectation:
\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\sigma^2\mathbf{I}_{n\times n}\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
\[ = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\] \[cov(\widehat{\beta}|X) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]
This gives us the variance-covariance matrix of \(\widehat{\beta}\). Can we start computing standard errors for \(\widehat{\beta}\) now?
We use residuals to stand in for \(\epsilon\).
\[\widehat{\sigma^2} = \frac{\sum\limits_i^n e_i^2}{n - p}\]
\(E[\widehat{\sigma^2} | X] = \sigma^2\) (unbiased)
so, we estimate the variance-covariance matrix:
\(\widehat{cov}(\widehat{\beta}) = \widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\)
The variance (first diagonal element of \(\widehat{cov}(\widehat{\beta})\)) is
\(Var(\hat{\beta_p}) = \frac{\hat{\sigma^2}}{\sum\limits_i^n (X^*_{i} - \overline{X^*})^2}\)
So the standard error is:
\(SE(\hat{\beta_p}) = \sqrt{Var(\hat{\beta_p})} = \frac{\widehat{\sigma}}{\sqrt{n}\sqrt{Var(X_p^*)}}\)
asymptotic normality:
Even if \(\epsilon\) not normally distributed: under our assumptions, then \(\lim_{n\to\infty}\) sampling distribution of \(\hat{\beta}\) is normally distributed by the central limit theorem. Reasonable if \(n\) is moderately large and data is not highly skewed.
Because we estimate standard errors: \(t\) statistic for hypothesis testing of \(\widehat{\beta}\) asymptotically \(t\) distributed:
In sum:
For standard errors of \(\widehat{\beta}\) to be correct, we must assume:
If any assumption is wrong: standard errors will be incorrect.
If assumption 1 and 3 are wrong: unbiasedness of \(\widehat{\beta}\) does not hold.
In groups:
What is an example of the assumption being wrong?
Why would it lead our standard errors to be wrong?
When do things go wrong?
With these assumptions: